Data Scientist

1000+ Data Scientist Interview Questions and Answers

Updated 5 Jul 2025
search-icon

Q. for a data with 1000 samples and 700 dimensions, how would you find a line that best fits the data, to be able to extrapolate? this is not a supervised ML problem, there's no target. and how would you do it, if...

read more
Ans.

To find a line that best fits the data with 1000 samples and 700 dimensions, we can use linear regression.

  • For unsupervised ML approach, we can use Principal Component Analysis (PCA) to reduce dimensions and then fit a line using linear regression.

  • For supervised ML approach, we need to select a target column. We can choose any of the 700 dimensions as the target and treat it as a regression problem.

  • Potential problems of treating this as a supervised problem include: lack of in...read more

1d ago

Q. Special Sum of Array Problem Statement

Given an array 'arr' containing single-digit integers, your task is to calculate the total sum of all its elements. However, the resulting sum must also be a single-digit ...read more

Ans.

Calculate the total sum of array elements until a single-digit number is obtained by repeatedly summing digits.

  • Iterate through the array and calculate the sum of all elements.

  • If the sum is a single-digit number, return it. Otherwise, repeat the process of summing digits until a single-digit number is obtained.

  • Return the final single-digit sum.

Data Scientist Interview Questions and Answers for Freshers

illustration image

Asked in Affine

2d ago

Q. You have a pandas dataframe with three columns filled with state names, city names, and arbitrary numbers, respectively. How do you retrieve the top two cities per state based on the maximum number in the third...

read more
Ans.

Retrieve top 2 cities per state based on max number in third column of pandas dataframe.

  • Group the dataframe by state column

  • Sort each group by the third column in descending order

  • Retrieve the top 2 rows of each group using head(2) function

  • Concatenate the resulting dataframes using pd.concat() function

Asked in Walmart

1d ago

Q. Describe the data you would analyze to solve cost and revenue optimization case studies. How would you formulate the objective functions?

Ans.

Answering a question on data and objective function for cost and revenue optimization case studies.

  • For cost optimization, look at data related to expenses, production costs, and resource allocation.

  • For revenue optimization, look at data related to sales, customer behavior, and market trends.

  • Objective function for cost optimization could be minimizing expenses while maintaining quality.

  • Objective function for revenue optimization could be maximizing profits while satisfying cus...read more

Are these interview questions helpful?

Asked in Amazon

4d ago

Q. Clone a Linked List with Random Pointers

Given a linked list where each node contains two pointers: one pointing to the next node and another random pointer that can point to any node within the list (or be nul...read more

Ans.

Create a deep copy of a linked list with random pointers.

  • Iterate through the original linked list and create a new node for each node in the list.

  • Store the mapping of original nodes to new nodes in a hashmap to handle random pointers.

  • Update the random pointers of new nodes based on the mapping stored in the hashmap.

  • Return the head of the copied linked list.

Asked in Coforge

2d ago

Q. Given a list of numbers, find the indices of two numbers that add up to a specific target value. Do this without using nested for loops. For example, given the list l = [2, 15, 5, 7] and target t = 9, the outpu...

read more
Ans.

Finding index of 2 numbers having total equal to target in a list without nested for loop.

  • Use dictionary to store the difference between target and each element of list.

  • Iterate through list and check if element is in dictionary.

  • Return the indices of the two elements that add up to target.

Data Scientist Jobs

Robert Bosch Engineering and Business Solutions Private Limited logo
Data Scientist 4-6 years
Robert Bosch Engineering and Business Solutions Private Limited
4.1
Bangalore / Bengaluru
IBM India Pvt. Limited logo
Data Scientist-Advanced Analytics 3-7 years
IBM India Pvt. Limited
4.0
₹ 5 L/yr - ₹ 19 L/yr
(AmbitionBox estimate)
Pune
IBM India Pvt. Limited logo
Data Scientist-Artificial Intelligence 3-7 years
IBM India Pvt. Limited
4.0
₹ 5 L/yr - ₹ 28 L/yr
(AmbitionBox estimate)
Hyderabad / Secunderabad

Asked in EXL Service

2d ago

Q. How would you measure model effectiveness without using any confusion matrix metrics, given the data is highly imbalanced?

Ans.

One way to measure model effectiveness without using confusion matrix metrics is by using area under the receiver operating characteristic curve (AUC-ROC).

  • Calculate the AUC-ROC score to evaluate the model's ability to distinguish between positive and negative classes.

  • AUC-ROC considers the entire range of classification thresholds and is insensitive to class imbalance.

  • Higher AUC-ROC score indicates better model performance.

  • Example: A model with an AUC-ROC score of 0.85 perform...read more

Q. what is tokenization in NLP? and, to get raw tokens for a sentence with words seperated by space, why use tokenizers from nltk instead of str.split()?

Ans.

Tokenization in NLP is the process of breaking down text into smaller units called tokens.

  • Tokenization is a fundamental step in NLP for text preprocessing.

  • Tokens can be words, phrases, or even individual characters.

  • Tokenization helps in preparing text data for further analysis or modeling.

  • NLTK tokenizers provide additional functionalities like handling contractions, punctuation, etc.

  • str.split() may not handle complex tokenization scenarios as effectively as NLTK tokenizers.

Share interview questions and help millions of jobseekers 🌟

man-with-laptop
6d ago

Q. You have two different vectors with only a small change in one of the dimensions, but the predictions/output from the model are drastically different for each vector. Can you explain why this can be the case? I...

read more
Ans.

Small change in one dimension causing drastic difference in model output. Explanation and solution.

  • This is known as sensitivity to input

  • It can be caused by non-linearities in the model or overfitting

  • Regularization techniques can be used to reduce sensitivity

  • Cross-validation can help identify overfitting

  • Ensemble methods can help reduce sensitivity

  • It is generally a bad thing as it indicates instability in the model

Asked in ExxonMobil

3d ago

Q. In which direction does fluid flow in a vertical pipe when the pressures at two vertical locations are given?

Ans.

The direction of fluid flow in a vertical pipe depends on the pressure difference between two vertical locations.

  • Fluid flows from high pressure to low pressure.

  • If the pressure at the lower location is higher than the pressure at the upper location, the fluid will flow downwards.

  • If the pressure at the upper location is higher than the pressure at the lower location, the fluid will flow upwards.

  • The magnitude of the pressure difference determines the rate of fluid flow.

Asked in Walmart

4d ago

Q. How can you tune the hyperparameters of XGBoost, Random Forest, and SVM algorithms?

Ans.

Hyperparameters of XGBoost, Random Forest, and SVM can be tuned using techniques like grid search, random search, and Bayesian optimization.

  • For XGBoost, important hyperparameters to tune include learning rate, maximum depth, and number of estimators.

  • For Random Forest, important hyperparameters to tune include number of trees, maximum depth, and minimum samples split.

  • For SVM, important hyperparameters to tune include kernel type, regularization parameter, and gamma value.

  • Grid ...read more

Q. When tokenizing, if you want to avoid breaking up specific word pairs (or triplets), for example, to not tokenize the words 'first' and 'name' when they occur together and consider them as a single token, how w...

read more
Ans.

Use NLTK's MWETokenizer to preserve specific word pairs or triplets during tokenization.

  • MWETokenizer allows you to define multi-word expressions (MWEs) that should be treated as single tokens.

  • Example: If you define MWE as ('first', 'name'), the tokenizer will keep 'first name' together.

  • You can add multiple MWEs, such as ('New', 'York') and ('data', 'science').

  • This is particularly useful in NLP tasks where context matters, like sentiment analysis.

Q. Q2.) Difference between list and tuple? a = [1,2,3,4,5,6,7,8,9] print(a[-1:-5]) Without running this code in compiler, tell the output

Ans.

The code will output an empty list as a result of slicing from -1 to -5 in the list 'a'.

  • Slicing in Python allows you to access a subset of elements in a list or tuple.

  • When slicing, the start index is inclusive and the end index is exclusive.

  • In this case, a[-1:-5] will result in an empty list because the start index -1 is greater than the end index -5.

Asked in Affine

4d ago

Q. How do you retain special characters (that pandas discards by default) in the data while reading it?

Ans.

To retain special characters in pandas data, use encoding parameter while reading the data.

  • Use encoding parameter while reading the data in pandas

  • Specify the encoding type of the data file

  • Example: pd.read_csv('filename.csv', encoding='utf-8')

Asked in Rolls-Royce

1d ago

Q. What are the types of ML algorithms? Give an example of each.

Ans.

There are several types of ML algorithms, including supervised learning, unsupervised learning, and reinforcement learning.

  • Supervised learning: algorithms learn from labeled data to make predictions or classifications (e.g., linear regression, decision trees)

  • Unsupervised learning: algorithms find patterns or relationships in unlabeled data (e.g., clustering, dimensionality reduction)

  • Reinforcement learning: algorithms learn through trial and error by interacting with an enviro...read more

Q. Given sample data in text format, how would you read it into Python, check for null and unique values, and create a new column by multiplying two existing features?

Ans.

Read sample data in text, check for null and unique values, create new column by multiplying two features

  • Save text data as CSV and read in Python using pandas

  • Use isnull() to check for null values

  • Use nunique() to check for unique values

  • Create a new column by multiplying two existing columns

  • Add the new column to the existing dataframe

Asked in Chubb

4d ago

Q. how will you get the embeddings of long sentences/paragraphs that transformer models like BERT truncate? how will you go about using BERT for such sentences? will you use sentence embeddings or word embeddings...

read more
Ans.

To get embeddings of long sentences/paragraphs truncated by BERT, we can use pooling techniques like mean/max pooling.

  • We can use pooling techniques like mean/max pooling to get embeddings of truncated sentences/paragraphs.

  • We can also use sliding window approach to get embeddings of overlapping segments of the long input.

  • For using BERT on such long inputs, we can use sentence embeddings or word embeddings depending on the task.

  • Models like Longformer and Reformer can handle lon...read more

Asked in Feynn Labs

3d ago

Q. What is the difference between Linear Regression and Logistic Regression?

Ans.

Linear Regression is used for predicting continuous numerical values, while Logistic Regression is used for predicting binary categorical values.

  • Linear Regression predicts a continuous output, while Logistic Regression predicts a binary output.

  • Linear Regression uses a linear equation to model the relationship between the independent and dependent variables, while Logistic Regression uses a logistic function.

  • Linear Regression assumes a linear relationship between the variables...read more

Asked in Walmart

2d ago
Q. How can you tune the hyperparameters of the XGBoost algorithm?
Ans.

Hyperparameters of XGBoost can be tuned using techniques like grid search, random search, and Bayesian optimization.

  • Use grid search to exhaustively search through a specified parameter grid

  • Utilize random search to randomly sample hyperparameters from a specified distribution

  • Apply Bayesian optimization to sequentially choose hyperparameters based on the outcomes of previous iterations

Asked in Accenture

5d ago

Q. Why we use mission learning Mission learning used for analysis the data's and we can able to predict and we add some additional algorithm it's mainly used for prediction and AI.

Ans.

Mission learning is used for data analysis and prediction with additional algorithms for AI.

  • Mission learning is a subset of machine learning that focuses on predicting outcomes based on data analysis.

  • It involves using algorithms to learn patterns and make predictions based on new data.

  • Examples include image recognition, natural language processing, and recommendation systems.

2d ago

Q. How did you prevent your model from overfitting ? What did you do when it was underfit ?

Ans.

To prevent overfitting, I used techniques like regularization, cross-validation, and early stopping. For underfitting, I tried increasing model complexity and adding more features.

  • Used regularization techniques like L1 and L2 regularization to penalize large weights

  • Used cross-validation to evaluate model performance on different subsets of data

  • Used early stopping to prevent the model from continuing to train when performance on validation set stops improving

  • For underfitting, ...read more

Asked in Turing

4d ago

Q. What is the neighborhood in which superhosts have the biggest median price difference with respect to non-superhosts?

Ans.

The neighbourhood with the biggest median price difference between superhosts and non superhosts is X.

  • Calculate the median price for superhosts and non superhosts in each neighbourhood

  • Find the neighbourhood with the largest difference in median prices between superhosts and non superhosts

  • Example: Neighbourhood X has a median price of $200 for superhosts and $150 for non superhosts, resulting in a $50 difference

Asked in Affine

1d ago

Q. How would you approach finding the number of white cars in a city?

Ans.

Estimate the number of white cars using surveys, traffic data, and image recognition techniques.

  • Conduct surveys: Ask residents about car colors in their neighborhoods.

  • Use traffic cameras: Analyze footage to count white cars during peak hours.

  • Leverage social media: Analyze posts or images of cars in the city.

  • Utilize machine learning: Train a model on images of cars to identify white ones.

  • Collaborate with local authorities: Access registration data for car colors.

Asked in Citicorp

3d ago

Q. Which test is used in logistic regression to check the significance of the variable?

Ans.

The Wald test is used in logistic regression to check the significance of the variable.

  • The Wald test calculates the ratio of the estimated coefficient to its standard error.

  • It follows a chi-square distribution with one degree of freedom.

  • A small p-value indicates that the variable is significant.

  • For example, in Python, the statsmodels library provides the Wald test in the summary of a logistic regression model.

Asked in Nielsen

3d ago

Q. Write pandas query to separate the names as first and last name from the full name. Drop the duplicate columns and also the missing values. Write output for the Python code. Write SQL query to retrieve the name...

read more
Ans.

Answering questions related to data science concepts and techniques.

  • Recall is the ratio of correctly predicted positive observations to the total actual positives. Precision is the ratio of correctly predicted positive observations to the total predicted positives.

  • To reduce variance in an ensemble model, techniques like bagging, boosting, and stacking can be used. Bagging involves training multiple models on different subsets of the data and averaging their predictions. Boost...read more

Asked in Walmart

1d ago

Q. What do these hyperparameters in the above-mentioned algorithms actually mean?

Ans.

Hyperparameters are settings that control the behavior of machine learning algorithms.

  • Hyperparameters are set before training the model.

  • They control the learning process and affect the model's performance.

  • Examples include learning rate, regularization strength, and number of hidden layers.

  • Optimizing hyperparameters is important for achieving better model accuracy.

Asked in MasterCard

1d ago

Q. How do you deal with senior customers when you don't have enough data?

Ans.

Communicate transparently and offer alternative solutions.

  • Explain the limitations of the available data and the potential risks of making decisions based on incomplete information.

  • Offer alternative solutions that can be implemented with the available data.

  • Collaborate with the customer to identify additional data sources or explore other options to gather more data.

  • Provide regular updates on the progress of data collection and analysis.

  • Ensure that all decisions are based on so...read more

Asked in Walmart

1d ago
Q. Can you explain the hyperparameters in the XGBoost algorithm?
Ans.

Hyperparameters in XGBoost algorithm control the behavior of the model during training.

  • Hyperparameters include parameters like learning rate, max depth, number of trees, etc.

  • They are set before the training process and can greatly impact the model's performance.

  • Example: 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100

Asked in Affine

6d ago

Q. How will the resultant table be when you merge two tables that match on a column, and the second table has many repeated keys?

Ans.

The resultant table will have all the columns from both tables and the rows will be a combination of matching rows.

  • The resultant table will have all the columns from both tables

  • The rows in the resultant table will be a combination of matching rows

  • If the second table has repeated keys, there will be multiple rows with the same key in the resultant table

Asked in Axtria

1d ago
Q. What is the difference between Ridge and LASSO regression?
Ans.

Ridge and LASSO regression are both regularization techniques used in linear regression to prevent overfitting by adding penalty terms to the cost function.

  • Ridge regression adds a penalty term equivalent to the square of the magnitude of coefficients (L2 regularization).

  • LASSO regression adds a penalty term equivalent to the absolute value of the magnitude of coefficients (L1 regularization).

  • Ridge regression tends to shrink the coefficients towards zero but does not set them e...read more

1
2
3
4
5
6
7
Next

Interview Experiences of Popular Companies

TCS Logo
3.6
 • 11.1k Interviews
Accenture Logo
3.7
 • 8.7k Interviews
Infosys Logo
3.6
 • 7.9k Interviews
Cognizant Logo
3.7
 • 5.9k Interviews
Capgemini Logo
3.7
 • 5.1k Interviews
View all
interview tips and stories logo
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Scientist Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
play-icon
play-icon
qr-code
Trusted by over 1.5 Crore job seekers to find their right fit company
80 L+

Reviews

10L+

Interviews

4 Cr+

Salaries

1.5 Cr+

Users

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2025 Info Edge (India) Ltd.

Follow Us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter
Profile Image
Hello, Guest
AmbitionBox Employee Choice Awards 2025
Winners announced!
awards-icon
Contribute to help millions!
Write a review
Write a review
Share interview
Share interview
Contribute salary
Contribute salary
Add office photos
Add office photos
Add office benefits
Add office benefits